SemanticScuttle - klotz.me » Tags: interpretability+large language model

Tags: interpretability* + large language model*

0 bookmark(s) - Sort by: Date ↓ / Title /

Interpretable Causal Diffusion Language Models

Steerling-8B is an interpretable causal diffusion language model that combines masked diffusion language modeling with concept decomposition, enabling generation, attribution, steering, and extraction of hidden representations. It offers features like block-causal attention and decomposition of hidden states into known and unknown concepts.

2026-02-24 Tags: attribution, concepts, models, decomposition, features, diffusion, interpretability, explanations, explainability, llms, generative-ai by klotz

Mechanistic Interpretability: Peeking Inside an LLM

This article explores the field of mechanistic interpretability, aiming to understand how large language models (LLMs) work internally by reverse-engineering their computations. It discusses techniques for identifying and analyzing the functions of individual neurons and circuits within these models, offering insights into their decision-making processes.

2026-02-06 Tags: llm, mechanistic interpretability, visualization, reverse engineering, neural networks, interpretability, machine learning by klotz

Meet the new biologists treating LLMs like aliens

Researchers are studying large language models as if they were living things, discovering secrets by applying biological and neurological analysis techniques. This approach is revealing unexpected behaviors and limitations of LLMs.

2026-01-13 Tags: llm, interpretability, neurology, biology, openai, san francisco by klotz

Large Language Models are Locally Linear Mappings

This paper demonstrates that the inference operations of several open-weight large language models (LLMs) can be mapped to an exactly equivalent linear system for an input sequence. It explores the use of the 'detached Jacobian' to interpret semantic concepts within LLMs and potentially steer next-token prediction.

2025-06-02 Tags: llm, interpretability, jacobian, next-token prediction, transformer models, deep learning, machine learning by klotz

Mapping the latent space of Llama 3.3 70B

Sparse autoencoders (SAEs) have been trained on Llama 3.3 70B, releasing an interpreted model accessible via API, enabling research and product development through feature space exploration and steering.

2024-12-25 Tags: llm, llama 3.3, sparse autoencoders, sae, latent space, features, xai, api, interpretability by klotz

Gemma Scope | NeuronPEDIA

Gemma Scope is an open-source, multi-scale, high-throughput microscope system that combines brightfield, fluorescence, and confocal microscopy, designed for imaging large samples like brain tissue.

2024-08-02 Tags: gemma scope, gemma, llm, neuropedias, interpretability, xai, deep learning by klotz

Gemma Scope: helping the safety community shed light on the inner workings of language models

DeepMind's Gemma Scope provides researchers with tools to better understand how Gemma 2 language models work through a collection of sparse autoencoders. This helps in understanding the inner workings of these models and addressing concerns like hallucinations and potential manipulation.

2024-11-14 Tags: llm, interpretability, gemma scope, autoencoder, deepmind, visualization, xai, analysis by klotz

Refusal in LLMs is mediated by a single direction

This post discusses a study that finds that refusal behavior in language models is mediated by a single direction in the residual stream of the model. The study presents an intervention that bypasses refusal by ablating this direction, and shows that adding in this direction induces refusal. The study is part of a scholars program and provides more details in a forthcoming paper.

2024-06-10 Tags: large language model, refusal, interpretability, ai alignment, safety, fine-tuning by klotz

First / Previous / Next / Last / Page 1 of 0

SemanticScuttle - klotz.me

Tags: interpretability* + large language model*

Linked Tags

Related Tags